Code
Output
[1] "2014-01-02"
[1] "2015-08-10"
[1] "1978-10-29"
[1] NA
Module 5.2: Text as Data
Old Dominion University
Often times, there’s useful information contained in the columns with text. Or, perhaps numeric data is formatted in a non-numeric. The easiest example of this is date or time data.
In this portion of the module, we’ll go through ways of working with text and dates.
We are going to focus on two datasets as examples:
lubridateBefore we get too into the weeds of text data, we should introduce a new package called lubridate.1 This package facilitates working with dates which are a weird mix of text and numbers.
Let’s go through some of the important functions / capabilities in lubridate.
Note: There is a great cheatsheet available online.
lubridateymd(), dmy(), mdy(): when you read data into R and one of the columns has entries like "January 2 2014", we can use these functions to help us out. Note that y stands for year, d stands for day, and m stands for month, so just choose the function that makes sense for your situation.
Output from these functions appear in year-month-day format, but are actually numeric under the hood. We can see this by executing as.numeric(mdy("January 2 2014")): 16072.
lubridateOnce your data is converted to well-behaved date objects, we can begin to manipulate them. We can extract the year, month, day, etc. using lubridate’s helpfully named year(), month(), and day() functions. We can also grab the weekday by using wday(df$date, label = TRUE).
Next, we can use floor_date(df$date, "quarter") “round” our dates to the nearest quarter, month, year, whatever.
Finally, we can calculate differences in dates by using simple subtraction. However, often times we want to know the number of, say, months between two dates. For this, we can use the following code: interval(ymd(date1),ymd(date2)) %% months(1).
We’ll go through some examples in a moment.
Date data is a very particular type of text data. However, working with text more generally is equally important when programming. We will not use any new packages for this, as base R has plenty of functions for us.
The first function we’ll take a look at is paste(), which has a sibling function paste0().
paste() takes whatever vectors you give it and smushes them together. paste() will recycle the shorter of the vector(s) until it matches the length of the longest vector. paste() will separate text with spaces by default, but this can be changed via the sep argument. paste0() defaults to sep = "", or nothing inbetween text. Some examples are below:
paste("alex", "cardazzi"); cat("\n")
paste("alex", c("cardazzi", "trebek", "hamilton")); cat("\n")
paste(c("alexander", "alex"), c("cardazzi", "trebek", "hamilton")); cat("\n")
paste(c("alex"), "middle-name", c("cardazzi", "trebek")); cat("\n")
paste("alex", c("cardazzi", "trebek", "hamilton"), sep = "-"); cat("\n")
paste0("alex", c("cardazzi", "trebek", "hamilton")); cat("\n")
paste(c("alex", "christine"), c("cardazzi", "strong"), sep = " abcxyz ")[1] "alex cardazzi"
[1] "alex cardazzi" "alex trebek" "alex hamilton"
[1] "alexander cardazzi" "alex trebek" "alexander hamilton"
[1] "alex middle-name cardazzi" "alex middle-name trebek"
[1] "alex-cardazzi" "alex-trebek" "alex-hamilton"
[1] "alexcardazzi" "alextrebek" "alexhamilton"
[1] "alex abcxyz cardazzi" "christine abcxyz strong"
Next, rather than combining character strings, we can break them up. There are a few ways to do this, but let’s work with substr() first. substr() accepts three arguments: x, start, and stop. Effectively, you tell R the “full” character vector and then the start and stop positions of the characters you want. Some examples are below:
char <- c("julius randle", "jalen brunson")
substr(char, 1, 3)
# use `regexpr()` to give you the position of
# a certain sub-string in another string.
substr(char, 1, regexpr(" ", char)); cat("\n")
# use -1 to *not* get the substring you're looking for
substr(char, 1, regexpr(" ", char) - 1); cat("\n")
# when using regexpr(), if the substring does not appear,
# it will return -1, and nothing will be returned.
substr(char, 1, regexpr("b", char)); cat("\n")
# use nchar() to get the length of the string.
substr(char, regexpr(" ", char), nchar(char)); cat("\n")
# here, use +1 to avoid the space
substr(char, regexpr(" ", char) + 1, nchar(char))[1] "jul" "jal"
[1] "julius " "jalen "
[1] "julius" "jalen"
[1] "" "jalen b"
[1] " randle" " brunson"
[1] "randle" "brunson"
To check if a certain substring exists in another string, we can use grepl(). This accepts two arguments: a single substring and a vector of other strings. Examples follow:
[1] TRUE TRUE FALSE
[1] TRUE TRUE TRUE
Similarly, we can use gsub() to find and replace things.
Working with text patterns can get a bit crazy. There’s a whole mini-language for this that we won’t fully dive into, but I want to expose you to some of it. Here are some important ones:
\\d: this is all digits 0-9. This is very helpful if you want to quickly find and replace all numeric values.\\D: this is the opposite – all non-digit characters.[[:alpha:]]: all alphabetic characters.\\s: space, tab, new line, etc. This is helpful when there’s a bunch of spaces you want to delete.[[:punct:]]: all punctuation characters..: this means any character.+: means match the previous character at least once. This is helpful when there are many, for example, spaces in a row. You can use gsub("\\s+", "", txt).^: indicates the beginning of a string.$: indicates the end of a string.Below are some examples of how to use regular expressions:
namez <- c("margot elise robbie",
"samuel l jackson",
"jennifer lawrence")
# find space-something-space and replace with nothing.
gsub("\\s.+\\s", "", namez); cat("\n")
# find letters-space and replace with nothing
gsub("[[:alpha:]]+\\s", "", namez); cat("\n")
# find start_of_string-letters-space and replace with nothing
gsub("^[[:alpha:]]+\\s", "", namez); cat("\n")
# find space-letters-end_of_string and replace with nothing
gsub("\\s[[:alpha:]]+$", "", namez)[1] "margotrobbie" "samueljackson" "jennifer lawrence"
[1] "robbie" "jackson" "lawrence"
[1] "elise robbie" "l jackson" "lawrence"
[1] "margot elise" "samuel l" "jennifer"
Regular expressions are very difficult, but very convenient once you get the hang of them. Just like lubridate, there’s a fantastic cheatsheet online you should check out.
Let’s begin by practicing on the Knicks roster:
X No. Player Pos Ht Wt Birth.Date Var.7 Exp College bbrefID
1 1 51 Ryan Arcidiacono PG 6-3 195 March 26, 1994 us 5 Villanova /players/a/arcidry01.html
2 2 9 RJ Barrett SG 6-6 214 June 14, 2000 ca 3 Duke /players/b/barrerj01.html
3 3 11 Jalen Brunson PG 6-2 190 August 31, 1996 us 4 Villanova /players/b/brunsja01.html
4 4 13 Evan Fournier SG 6-7 205 October 29, 1992 fr 10 /players/f/fournev01.html
5 5 6 Quentin Grimes SG 6-5 205 May 8, 2000 us 1 Kansas, Houston /players/g/grimequ01.html
6 6 3 Josh Hart SF 6-5 215 March 6, 1995 us 5 Villanova /players/h/hartjo01.html
7 7 55 Isaiah Hartenstein C 7-0 250 May 5, 1998 us 4 /players/h/harteis01.html
8 8 8 DaQuan Jeffries SG 6-5 230 August 30, 1997 us 3 Oral Roberts, Tulsa /players/j/jeffrda01.html
9 9 0, 3 Trevor Keels SG 6-5 221 August 26, 2003 us R Duke /players/k/keelstr01.html
10 10 2 Miles McBride PG 6-2 200 September 8, 2000 us 1 West Virginia /players/m/mcbrimi01.html
11 11 17 Svi Mykhailiuk SF 6-7 205 June 10, 1997 ua 4 Kansas /players/m/mykhasv01.html
12 12 5 Immanuel Quickley SG 6-3 190 June 17, 1999 us 2 Kentucky /players/q/quickim01.html
13 13 30 Julius Randle PF 6-8 250 November 29, 1994 us 8 Kentucky /players/r/randlju01.html
14 14 0 Cam Reddish SF 6-8 218 September 1, 1999 us 3 Duke /players/r/reddica01.html
15 15 23 Mitchell Robinson C 7-0 240 April 1, 1998 us 4 Western Kentucky /players/r/robinmi01.html
16 16 4 Derrick Rose PG 6-3 200 October 4, 1988 us 13 Memphis /players/r/rosede01.html
17 17 45 Jericho Sims C 6-10 245 October 20, 1998 us 1 Texas /players/s/simsje01.html
18 18 1 Obi Toppin PF 6-9 220 March 4, 1998 us 2 Dayton /players/t/toppiob01.html
Let’s calculate the following items:
First, let’s examine which players are guards. To do this, we are going to use grepl() to search for "G" in the Pos column. If we find it, we are going to give this player a 1. If we don’t, we are going to give them a 0.
Player Pos guard
1 Ryan Arcidiacono PG 1
2 RJ Barrett SG 1
3 Jalen Brunson PG 1
4 Evan Fournier SG 1
5 Quentin Grimes SG 1
6 Josh Hart SF 0
Percent of the roster that are guards: 56 %
Next, let’s calculate the team’s average experience in the league. However, as you might have noticed, rookie players (meaning players who have never played in the league before) have an “R” for their experience. If we calculate an average of this vector, R will return NA because of this. If we simply drop the “R”/NA values, this will overstate the average experience since these rookies should have values equal to zero. Let’s replace “R” with 0.
The next bullet wants us to calculate player heights. To do this, we are going to grab the first part of their height (feet), multiply by 12 to get inches, and then add in the second part of their height (inches).
Player Ht foot inch Ht_inch
1 Ryan Arcidiacono 6-3 6 3 75
2 RJ Barrett 6-6 6 6 78
3 Jalen Brunson 6-2 6 2 74
4 Evan Fournier 6-7 6 7 79
5 Quentin Grimes 6-5 6 5 77
6 Josh Hart 6-5 6 5 77
Now we are going to calculate a players age in days at the start of the season. To do this, we have to convert their birthday-text into a numeric birthday object. Then, we are going to subtract the specific date from the birthdays vector. This will return the difference in days. We can also take the difference in months (or years, quarters, etc.) which will be demonstrated below.
Player Birth.Date bday age age_m
1 Ryan Arcidiacono March 26, 1994 1994-03-26 10434 342
2 RJ Barrett June 14, 2000 2000-06-14 8162 268
3 Jalen Brunson August 31, 1996 1996-08-31 9545 313
4 Evan Fournier October 29, 1992 1992-10-29 10947 359
5 Quentin Grimes May 8, 2000 2000-05-08 8199 269
6 Josh Hart March 6, 1995 1995-03-06 10089 331
Finally, we are going to calculate each player’s difference in age to Jalen Brunson. First, we are going to calculate JB’s age, and then subtract his age from everyone else’s age. To finish, we are going to sort the data by this difference.
Player jb_age_diff
3 Jalen Brunson 0
11 Svi Mykhailiuk 283
8 DaQuan Jeffries 364
6 Josh Hart 544
18 Obi Toppin 550
15 Mitchell Robinson 578
Next, we are going to explore some crime data. The Minneapolis police record information on stops they conduct such as location and date/time. Let’s generate some numeric data from both of these columns.
First, let’s read in the data and view it.
id date problem race gender location precinct
1 17-036337 2017-02-01T00:00:12Z traffic White Male 44.95134737 -93.28133076 5
2 17-036349 2017-02-01T00:07:38Z suspicious Black Female 44.9474742 -93.29829195 5
3 17-036351 2017-02-01T00:11:39Z traffic Other Male 44.89233 -93.28067 5
4 17-036357 2017-02-01T00:18:50Z suspicious Latino Male 45.01497 -93.24734 2
5 17-036360 2017-02-01T00:22:41Z traffic Black Male 45.00951934 -93.28989378 4
6 17-036378 2017-02-01T00:39:23Z suspicious Unknown Unknown 44.94047824 -93.26763786 3
7 17-036380 2017-02-01T00:39:53Z suspicious Black Male 44.9784072 -93.27881908 1
8 17-036381 2017-02-01T00:40:11Z traffic White Female 44.97429462 -93.27996035 1
9 17-036388 2017-02-01T00:46:54Z traffic Black Female 44.94659 -93.2797 5
10 17-036390 2017-02-01T00:48:37Z traffic White Male 44.97526008 -93.26992525 1
Next, let’s convert the datetime text into something usable with lubridate. Since there’s time information in this, we need to use the still-conveniently named ymd_hms function. Below we’ll generate some distribution plots of the weekday, hour, and minute of these police stops.
Finally, we can extract latitude and longitude from the loc column, and plot the resulting points as a map.
There’s a whole subset of data that we have not talked about yet that has to do with spatial features of data. Here, I am going to plot Minneapolis neighborhoods with the stops data on top of it. There’s tons of things you can do with this, but it is outside the scope of this course.
ECON 311: Economics, Causality, and Analytics